Plotting unstructured data

In the next two hours we are going to explore some of the ways unstructured data can be plotted using R. The list of plot is by no means exhaustive and it is just a short overview of possible options. A very good starting point to explore graphs remains https://r-graph-gallery.com/

Plotting OCR Results

Let’s say that you are exploring the results of OCR’d documents and you want to check where the low percentage of confidentiality of in OCR it comes from and if it cluster around certain areas Let’s see the original file:

image1 <- image_read("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/page10.jpg")
image1

ok this seems quite a well defined images let’s see the results of the OCR on it

Let’s import the OCR file

PrideAndPrejudiceCh10<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/OCRPrideAndPrejudice.csv")

What is the average?

 mean(PrideAndPrejudiceCh10$confidence)
## [1] 94.59075

Now let’s see where the issues are

I can do that by plotting the ocr confidence level on top of the original page

ggplot(PrideAndPrejudiceCh10, aes(MeanX, Meany, colour= confidence)) + #Colour code by confidence
  background_image(image1)+ # Select the image I want as a background 
  geom_point(shape=15, size=5, alpha=0.5)+#plot them as squared points
  coord_cartesian(xlim = c(0, 2481),ylim = c(3508, 0), 
                  expand = FALSE)+ #Use the sizes of the image to scale the graph 
  scale_colour_continuous(low="red", high="green") + #Create a continuous scale from green to red to colour-code the results
  theme_bw()# b/w background

Repeating the process with a second image

First I import the image

image2 <- image_read("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/page1.jpg")
image2

Ok this seems quite a well defined images let’s see the results of the OCR on it

Then the OCRed File

Again I am importing them from GitHub

PrideAndPrejudiceIncipit<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/PrideAndPrejudiceIncipit.csv")

What is the average for this image?

mean(PrideAndPrejudiceIncipit$confidence)
## [1] 42.67584

That is not good at all but can we see where the issues are through data visualisation

Now let’s see where the issues are

We can do so by yet again plotting it

ggplot(PrideAndPrejudiceIncipit, aes(MeanX, Meany, colour= confidence)) + #Colour code by confidence
  background_image(image2)+ # Select the image I want as a background 
  geom_point(shape=15, size=5, alpha=0.5)+#plot them as squared points
  coord_cartesian(xlim = c(0, 3307),ylim = c(4677, 0), 
                  expand = FALSE)+ #Use the sizes of the image to scale the graph 
  scale_colour_continuous(low="red", high="green") + #Create a continuous scale from green to red to colour-code the results
  theme_bw()# b/w background

Textual Data

This is what you can do with OCR data let’s now look at proper textual data

Before being able to plot unstructured data there is some cleaning and setting up that is always needed.

So let’s go through step by step:

Step 1 Import the data

We do so yet again from the GitHub

ScotAccount<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/ScotlandParishes.csv")

Step 2 Examine the data

We do that by running summary

summary(ScotAccount)
##     title               text               Type           TypeDescriptive   
##  Length:27065       Length:27065       Length:27065       Length:27065      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    RecordID             Area              Parish              Year          
##  Length:27065       Length:27065       Length:27065       Length:27065      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     Tables         
##  Length:27065      
##  Class :character  
##  Mode  :character

This is a very large dataset

The ‘Old’Statistical Account(1791-99), under the direction of Sir John Sinclair of Ulbster, and the ‘New’ Statistical Account(1834-45) are reports of life during.

They offer uniquely rich and detailed parish reports for the whole of Scotland, covering a vast range of topics including agriculture, education, trades, religion and social customs.

https://stataccscot.edina.ac.uk/static/statacc/dist/home

Everything from changing fashions in dress to the different attitudes to smallpox inoculation and resulting high infant mortality between the north and south of Scotland

Our datasets are 27065 records corresponding to single reports from the statistical accounts about a certain Parish.

Step 3 create a corpus out of the csv file

A quanteda corpus is a special type of data structure used to represent a collection of texts. It is designed to facilitate the analysis of textual data in a quantitative manner. The quanteda package provides functions for creating, manipulating, and analyzing corpora.

If you do not specify a column the system will search for a variable named text if you do not have one you need to specify the name of column that contain the text you want to analyse.

ScotAccountCorpus<-corpus(ScotAccount)

Step 4 Preprocess, clean and extract - Theory Time

Tokenise

“Tokenize” is a term used in natural language processing (NLP) and text analysis to refer to the process of breaking down a text into individual units, which are often referred to as “tokens.” Tokens can be words, subwords, or other units of text, depending on the level of granularity desired for analysis.

Word Tokenization:

Definition: Breaking a text into individual words. Example: “The quick brown fox” would be tokenized into [“The”, “quick”, “brown”, “fox”].

Sentence Tokenization:

Definition: Breaking a text into individual sentences. Example: “This is the first sentence. And this is the second.” would be tokenized into [“This is the first sentence.”, “And this is the second.”].

Tokenization is a fundamental step in many natural language processing tasks because it helps to convert raw text into a format that can be easily processed and analyzed by algorithms. Once tokenized, the resulting units can be used for tasks such as counting word frequencies, training machine learning models, or extracting linguistic features for further analysis.

Document-Feature Matrix (dfm)

In a dfm, each row corresponds to a document, and each column corresponds to a feature (typically a term or word). The matrix entries represent the frequency of each feature in each document. The term “feature” is used more broadly than “term” because the features can include not only individual words but also phrases or other linguistic units.

Let’s do these steps on our dataset

dfmat_ScotAccountCorpus <- corpus_subset(ScotAccountCorpus) |> # which data 
  tokens(remove_punct = TRUE,
         remove_numbers = TRUE, 
         remove_symbols=TRUE) |> # Tokenise
  tokens_select(min_nchar = 3)|> # Refine tokens by removing words shorter than 3 characters
  tokens_remove(pattern = c(stopwords('english'), "Parish")) |> # Remove common + custom stopwords
  dfm()|> # create a Document Feature Matrix
  dfm_trim(min_termfreq = 2000, verbose = FALSE) # remove all words that are recurring less than 2000 times. Verbose hide the result of this step from the console output this is just to speed up the process.  

Stop Words

Stop words are common words that are often filtered out during the preprocessing of natural language text data before analysis. These words are generally the most common words in a language and are considered to carry little or no meaningful information about the content of the text. Including stop words in text analysis can introduce noise and may not contribute much to the understanding of the underlying patterns.

Common examples of stop words in English include words like “the,” “and,” “is,” “in,” “to,” and so on. The specific list of stop words can vary depending on the context and the task at hand.

Option 1 - Word Cloud

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud

Let’s start from a simple one

set.seed(100) # to create reproducible results when writing code
textplot_wordcloud(dfmat_ScotAccountCorpus) # Plot our wordcloud

Compare different areas

if we want to compare across specific areas of Scotland First I need to process the corpus again but I am looking only at data from Fife, Edinburgh, Haddington and Stirling.

CorpusSubset<-corpus_subset(ScotAccountCorpus, # which data 
              Area %in% c("Fife", "Edinburgh", "Haddington", "Stirling")) |> # select only some data from the Area column, only those from Fife, Edinburgh, Haddington and Stirling 
  tokens(remove_punct = TRUE,
         remove_numbers = TRUE, 
         remove_symbols=TRUE) |> # tokenise
  tokens_select(min_nchar = 3)|> # Refine tokens by removing words shorter than 3 characters
  tokens_remove(pattern = c(stopwords('english'), "Parish", "Edinburgh", "Fife", "Haddington", "Stirling")) |> # remove English stopwords plus costum ones
  dfm() |>
  dfm_group(groups = Area) |> # create a Document Feature Matrix
  dfm_trim(min_termfreq =500, verbose = FALSE)  # remove all words that are recurring less than 500 times. Verbose hide the result of this step from the console output this is just to speed up the process.  

Let’s plot it

set.seed(100) # to create reproducible results when writing code

textplot_wordcloud(CorpusSubset,comparison = TRUE) # Comparison true would divide the cloud by the areas nb max number of cluster is 8. 

Let’s refine the colours

textplot_wordcloud(CorpusSubset,comparison = TRUE, 
                   color = c("red","purple", "green", "blue"),# Select colours
                   max_words = 100, # max number of words displayed 
                   min_size = 0.8,# min size font
                   max_size = 5)# max size font

Cut a Word Cloud

We can also be fancy and we can shape the wordcloud onto the shape of Scotland

To do so we need to use a library ggwordcloud. this works with dataframes so we need to transform our DFM into a dataframe containing the frequency of words we can do this by extracting the frequency (you will need quanteda.textstats for this as well)

tstat_freq_Stats <- textstat_frequency(dfmat_ScotAccountCorpus, n = 75) # extracting the top 75 words

Then we get the image that we want to use in this case the shape of Scotland

# First we get the image
image_url <- "https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/Scotland.png"

# Download the image to a local file
download.file(url = image_url, destfile = "scotland.png", mode = "wb")

set.seed(42)# Again this is to make it reproducible


# We can look at it 
Scotland<-image_read("scotland.png")
Scotland

# Finally we define the image path
image_path <- here::here("Scotland.png") # specify where the image is (need the here library) Need to be a Png for handling transparency

Plot it

ggplot(tstat_freq_Stats, aes(label = feature, size = frequency, colour = frequency)) +
  geom_text_wordcloud_area(
    mask = png::readPNG(image_path),
    rm_outside = TRUE  # This should remove text outside the mask area
  ) +
  scale_size_area(max_size =35) +
  scale_color_gradient(low = "darkblue", high = "blue") +
  theme_minimal()

Option 2: words Frequencies

We extracted the frequency now let’s look at other ways to plot it (i.e. what are the most recurrent words in a series of texts)

ggplot(tstat_freq_Stats, aes(x = frequency, y = reorder(feature, frequency))) +
  geom_point() + 
  labs(x = "Frequency", y = "Feature")+
  theme_bw()

If you wanted to compare the frequency of a single term across different texts, you can also use textstat_frequency, group the frequency by Area and extract the term.

Create document-level variable with Areas

First we re-tokenise but without do a dfm

toks_corpus_Scot_subset <- 
  corpus_subset(ScotAccountCorpus) |> # Ceci n'est pas une pipe...
  tokens()

Get frequency grouped by Areas

freq_grouped <- textstat_frequency(dfm(toks_corpus_Scot_subset), 
                                   groups = Area)

Filter the term “church” and Plot it

freq_church <- subset(freq_grouped, freq_grouped$feature %in% "church")  

ggplot(freq_church, aes(x = frequency, y = group)) +
  geom_point() + 
  scale_x_continuous(limits = c(0, 1750), breaks = c(seq(0, 1750, 150))) +
  labs(x = "Frequency", y = NULL,
       title = 'Frequency of "Church"')+
  theme_bw()

Option3: Lexical dispersion

For this one we need to use a different dataset cause the one we were working on is too big.

So we are using the inaugurations speaches of the USA presidents.

The dataset is a default dataset that can be used directly

First I Tokenise all speach past 1949

toks_corpus_inaugural_subset <- 
  corpus_subset(data_corpus_inaugural, Year > 1949) |>
  tokens()

Then I do what is called keyword in context

The term “keyword in context” (KWIC) refers to a method of displaying specific words or phrases in their surrounding context, typically within a larger body of text. This technique is commonly used in linguistics, information retrieval, and text analysis to better understand how certain words are used and the context in which they appear.

In a KWIC display, a specific word or phrase (the “keyword”) is centered in the middle of the display, and the text snippets or lines containing that keyword are shown in context, usually to the left and right of the keyword. This provides a quick and concise way to observe the usage patterns and surrounding words of a particular term.

kwic(toks_corpus_inaugural_subset, pattern = "american") |>
  textplot_xray()

Option 4 Plot dimensions of a large corpus

For the last bit we go back to our original dataset and we check how big the dataset is

It is always useful to extract information about the corpus we are working with Some methods for extracting information about the corpus:

# to explore this we focus on the text column of the ScotAccount
CorpusStat<-corpus(ScotAccount$text)

Some methods for extracting information about the corpus:

# Print doc in position 5 of the corpus
summary(CorpusStat, 5)
## Corpus consisting of 27065 documents, showing 5 documents:
## 
##   Text Types Tokens Sentences
##  text1    94    171         1
##  text2   167    399         1
##  text3   122    201         1
##  text4   123    262         1
##  text5   146    273         1
# Check how many docs are in the corpus
ndoc(CorpusStat) 
## [1] 27065
# Check number of characters in the first 10 documents of the corpus
nchar(CorpusStat[1:10]) 
##  text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##    940   1906   1011   1245   1354   1422   1995   1916   1855   1977
# Check number of tokens in the first 10 documents
ntoken(CorpusStat[1:10]) 
##  text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##    171    399    201    262    273    281    422    387    365    454

Can we create some better visualisation to look into it?

Yes of course.

Step 1

Create a new vector with tokens for all articles and store the vector as a new data frame with three columns (Ntoken, Dataset, Date).

NtokenStats<-as.vector(ntoken(CorpusStat))
TokenScotland <-data.frame(Tokens=NtokenStats, title=ScotAccount$title, Area=ScotAccount$Area, Parish=ScotAccount$Parish)

Step 2

Now we want to see how much material we have for each area. We can do that through pipes

BreakoutScotland<- TokenScotland |>
  group_by(Area)|>
  summarize(NReports=n(), MeanTokens=round(mean(Tokens)))

Step 3

Now we can plot the trends.

This is done through the use of the ggplot package that is a very handy package that will allow you to print a very big variety of graphs

ggplot(BreakoutScotland, aes(x=Area, y=NReports))+ # Select data set and coordinates we are going to plot
  geom_point(aes(size=MeanTokens, fill=MeanTokens),shape=21, stroke=1.5, alpha=0.9, colour="black")+ # Which graph I want 
  labs(x = "Areas", y = "Number of Reports", fill = "Mean of Tokens", size="Mean of Tokens", title="Number of Reports and Tokens in the Scotland Archive")+ # Rename labs and title
  scale_size_continuous(range = c(5, 15))+ # Resize the dots to be bigger
  geom_text(aes(label=MeanTokens))+ # Add the mean of tokens in the dots
  scale_fill_viridis_c(option = "plasma")+ # Change the colour coding
  theme_bw()+ # B/W Background
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position = "bottom")+ # Rotate labels of x and move them slightly down. Plus move the position to the bottom 
  guides(size = "none") # Remove the Size from the Legend 

The End